Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Wed Dec 01 21:43:12 2021"

About me

I have used R and RStudio quite a lot in my work. I’m familiar with Git and Github, but have not become ‘fluent’ with them yet – in fact, using them has always felt a bit cumbersome. I hope this course will help me get better with using Git/GitHub.

I heard about the course from my supervisor and department mailing list.

Here is a link to my github page for this project:https://github.com/jpkos/IODS-project

List test:

  • First item
  • Second item

Chapter 2: Linear regression practice

In this exercise, we are using linear regression to evaluate how different learning approaches affect student’s performance in a statistics course.

We will start off by loading the libraries

library(dplyr)
library(GGally)
library(ggplot2)

Read and describe the data

In the data wrangling section, we created a csv file with the data. Let’s load that.

data <- read.csv("data/learning2014_csv.csv")

Pick only the combined columns and the columns with participant info

cols <- c("Age", "Attitude", "Points", "gender", "deep", "surf", "stra")
data <- data[cols]

Data overview (dimensions [rows and cols] and structure [variables and their datatypes])

dim(data)
## [1] 166   7
str(data)
## 'data.frame':    166 obs. of  7 variables:
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...

THe dataset is about how students learning approaches affect their exam points in a statistics course. The data has combined variables that describe students learning approaches (deep, surface, strategic) and individual variables with participant information (age, gender, attitude towards statistics, exam points)

Summary of the variables:

summary(data)
##       Age           Attitude         Points         gender         
##  Min.   :17.00   Min.   :14.00   Min.   : 7.00   Length:166        
##  1st Qu.:21.00   1st Qu.:26.00   1st Qu.:19.00   Class :character  
##  Median :22.00   Median :32.00   Median :23.00   Mode  :character  
##  Mean   :25.51   Mean   :31.43   Mean   :22.72                     
##  3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:27.75                     
##  Max.   :55.00   Max.   :50.00   Max.   :33.00                     
##       deep            surf            stra      
##  Min.   :1.583   Min.   :1.583   Min.   :1.250  
##  1st Qu.:3.333   1st Qu.:2.417   1st Qu.:2.625  
##  Median :3.667   Median :2.833   Median :3.188  
##  Mean   :3.680   Mean   :2.787   Mean   :3.121  
##  3rd Qu.:4.083   3rd Qu.:3.167   3rd Qu.:3.625  
##  Max.   :4.917   Max.   :4.333   Max.   :5.000

Plot variable pairs to visualize correlations

p <- ggpairs(data, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

Some notes about the variables:

  • No major differences between genders
  • Biggest correlations between exam points and the strategic and surface approaches (but opposite signs)
  • Surface approach had negative correlation with attitude, whereas deep approach had a positive
  • Age had an opposite effect between surface and strategic approaches
  • However, Age distribution showed that most people were around the same age (early to mid 20s)

Linear regression models

Let’s fit a model with three variables (strategic approach, surface approach, attitude) and the exam points as the target variable

model <- lm(Points ~ stra + surf + Attitude, data=data)
summary(model)
## 
## Call:
## lm(formula = Points ~ stra + surf + Attitude, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## Attitude     0.33952    0.05741   5.913 1.93e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Attitude has a significant effect, the others do not. Let’s fit another model, this time only with Attitude

model2 <- lm(Points ~ Attitude, data=data)
summary(model2)
## 
## Call:
## lm(formula = Points ~ Attitude, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The result indicates that students attitude has a strong correlation with their points. 1-point increase in Attitude score increased the students exam points by about 0.35 (t = 6.124, p < 0.001).

The multiple R-squared measures how well the model fits the data, but also adjusts to the number of fitted variables. The multiple R-squared for the first model was 0.2074. This can be interpreted to mean that the model explains roughly 20.7% of the variance in the data. When we fit the model again with only Attitude as the variable, the R-squared is 0.1906 (~19.1%), meaning that Attitude alone explained a lot of the variance seen in the data.

Diagnosis plots:

plot(model2, c(1,2,5))

In the residuals vs fitted plot, we can see that with larger values the residuals may be getting smaller, although the number of data points is also smaller. Overall there does not seem to be serious heteroscedasticity (residuals should not depend on the fitted values, i.e. they should be randomly distributed).

The QQ-plot shows some deviation from normalcy at the extreme values, however the deviations are smallish and may not violate normality assumptions (errors should be normally distributed).

The residuals vs leverage plot shows that while there are a few points that are kind of outliers, they most likely do not seriously affect the model results.

Finally, let’s plot points vs attitude.

plot(data$Attitude, data$Points)


Chapter 3: Logistic regression practice

Let’s start by loading the necessary libraries

library(dplyr)
library(GGally)
library(ggplot2)
library(tidyr)
library(caret)

Dataset description

Load the data we processed in the data wrangling section (actually, I’m using the ready made dataset)

data <- read.csv("data/student/alc.csv")

Print a summary of the data structure

str(data)
## 'data.frame':    370 obs. of  51 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  15 15 15 15 15 15 15 15 15 15 ...
##  $ address   : chr  "R" "R" "R" "R" ...
##  $ famsize   : chr  "GT3" "GT3" "GT3" "GT3" ...
##  $ Pstatus   : chr  "T" "T" "T" "T" ...
##  $ Medu      : int  1 1 2 2 3 3 3 2 3 3 ...
##  $ Fedu      : int  1 1 2 4 3 4 4 2 1 3 ...
##  $ Mjob      : chr  "at_home" "other" "at_home" "services" ...
##  $ Fjob      : chr  "other" "other" "other" "health" ...
##  $ reason    : chr  "home" "reputation" "reputation" "course" ...
##  $ guardian  : chr  "mother" "mother" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 2 1 2 2 2 1 ...
##  $ studytime : int  4 2 1 3 3 3 3 2 4 4 ...
##  $ schoolsup : chr  "yes" "yes" "yes" "yes" ...
##  $ famsup    : chr  "yes" "yes" "yes" "yes" ...
##  $ activities: chr  "yes" "no" "yes" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "yes" "yes" "no" "yes" ...
##  $ romantic  : chr  "no" "yes" "no" "no" ...
##  $ famrel    : int  3 3 4 4 4 4 4 4 4 4 ...
##  $ freetime  : int  1 3 3 3 2 3 2 1 4 3 ...
##  $ goout     : int  2 4 1 2 1 2 2 3 2 3 ...
##  $ Dalc      : int  1 2 1 1 2 1 2 1 2 1 ...
##  $ Walc      : int  1 4 1 1 3 1 2 3 3 1 ...
##  $ health    : int  1 5 2 5 3 5 5 4 3 4 ...
##  $ n         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ id.p      : int  1096 1073 1040 1025 1166 1039 1131 1069 1070 1106 ...
##  $ id.m      : int  2096 2073 2040 2025 2153 2039 2131 2069 2070 2106 ...
##  $ failures  : int  0 1 0 0 1 0 1 0 0 0 ...
##  $ paid      : chr  "yes" "no" "no" "no" ...
##  $ absences  : int  3 2 8 2 5 2 0 1 9 10 ...
##  $ G1        : int  10 10 14 10 12 12 11 10 16 10 ...
##  $ G2        : int  12 8 13 10 12 12 6 10 16 10 ...
##  $ G3        : int  12 8 12 9 12 12 6 10 16 10 ...
##  $ failures.p: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ paid.p    : chr  "yes" "no" "no" "no" ...
##  $ absences.p: int  4 2 8 2 2 2 0 0 6 10 ...
##  $ G1.p      : int  13 13 14 10 13 11 10 11 15 10 ...
##  $ G2.p      : int  13 11 13 11 13 12 11 10 15 10 ...
##  $ G3.p      : int  13 11 12 10 13 12 12 11 15 10 ...
##  $ failures.m: int  1 2 0 0 2 0 2 0 0 0 ...
##  $ paid.m    : chr  "yes" "no" "yes" "yes" ...
##  $ absences.m: int  2 2 8 2 8 2 0 2 12 10 ...
##  $ G1.m      : int  7 8 14 10 10 12 12 8 16 10 ...
##  $ G2.m      : int  10 6 13 9 10 12 0 9 16 11 ...
##  $ G3.m      : int  10 5 13 8 10 11 0 8 16 11 ...
##  $ alc_use   : num  1 3 1 1 2.5 1 2 2 2.5 1 ...
##  $ high_use  : logi  FALSE TRUE FALSE FALSE TRUE FALSE ...
##  $ cid       : int  3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...

The data describes each subjects school performance together with different study and personal information such as absences (from school), failures (at school), age, family size, education level, etc. G1, G2 and G3 are the grades in three different periods for the given course subject (portuguese or math). Alcohol level is a binary variable (high/low usage).

Analysis

We will analyse students’ alcohol consumption and its effects on 4 different variables. The dependent variable is high alcohol use (high_use), with the following independent variables and the hypotheses:

  1. Absences: Students that have more absences are more likely to have high alcohol use
  2. Failures: Students with more failues are more likely to have high alcohol use
  3. Sex: Men are more likely to have higher alcohol use
  4. Romantic: People who are in a relationship are less likely to have high alcohol use

First, let’s plot the distributions of these variables:

cols <- c("sex", "failures", "absences", "romantic", "high_use")
gather(data[cols]) %>% ggplot(aes(value)) + facet_wrap("key", scales="free") + geom_bar()

Fit a model with these variables:

model <- glm(high_use ~ absences + failures + sex + romantic, data=data, family="binomial")
summary(model)
## 
## Call:
## glm(formula = high_use ~ absences + failures + sex + romantic, 
##     family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1132  -0.8415  -0.5849   1.0329   2.1106  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.86714    0.24303  -7.683 1.56e-14 ***
## absences     0.09406    0.02323   4.049 5.15e-05 ***
## failures     0.60568    0.20812   2.910  0.00361 ** 
## sexM         0.98393    0.24789   3.969 7.21e-05 ***
## romanticyes -0.24606    0.26635  -0.924  0.35558    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 406.13  on 365  degrees of freedom
## AIC: 416.13
## 
## Number of Fisher Scoring iterations: 4

According to the results:

Hypotheses 1 - 3 were supported by the data. For hypothesis 4, the effect was negative as predicted, but the effect is not statistically significant.

The confidence intervals of these variables are:

confint(model)
## Waiting for profiling to be done...
##                   2.5 %     97.5 %
## (Intercept) -2.36326577 -1.4080742
## absences     0.05079132  0.1421351
## failures     0.20530644  1.0277095
## sexM         0.50384493  1.4776693
## romanticyes -0.77779991  0.2691722

We can also interpret the coefficients as odds ratios by running them through the exponent function

coef(model) %>% exp
## (Intercept)    absences    failures        sexM romanticyes 
##   0.1545656   1.0986256   1.8324980   2.6749408   0.7818748

Now the coefficients indicate e.g. being Male increases the odds of high alcohol use by 2.675, and so on.

Let’s compare predicted vs real results with a 2x2 table

predictions <- as.factor(predict(model, newdata = data, type = "response")>0.5)
true = as.factor(data$high_use)
confusionMatrix(data=predictions, reference=true)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE   247   77
##      TRUE     12   34
##                                           
##                Accuracy : 0.7595          
##                  95% CI : (0.7126, 0.8021)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 0.006538        
##                                           
##                   Kappa : 0.3122          
##                                           
##  Mcnemar's Test P-Value : 1.169e-11       
##                                           
##             Sensitivity : 0.9537          
##             Specificity : 0.3063          
##          Pos Pred Value : 0.7623          
##          Neg Pred Value : 0.7391          
##              Prevalence : 0.7000          
##          Detection Rate : 0.6676          
##    Detection Prevalence : 0.8757          
##       Balanced Accuracy : 0.6300          
##                                           
##        'Positive' Class : FALSE           
## 

Total proportion of inaccurate results: 77 + 12 = 89. Total number of values was 370, so 89/370 = 0.2405 or about 24% of the data was predicted with the wrong label.

To see if this is a good result, we consider what result could be achieved with random chance. If we just flipped coins, we would get 50% correct. The logistic regression model is clearly better than that. However, we can also see that the proportion of non-high users is much larger than the proportion of high alcohol users. There are 259 non-high users versus 111 high users. Thus, ff we had a model that predicted everyone as non-high user, we would get 70% correct. Therefore, we can see that the model is actually not that much better than a model “no one is high user”.

Chapter 4: Clustering and classification practice

As usual, let’s start with loading the libraries. Not sure which ones I’ll end up using I will load those we have used so far:

library(dplyr)
library(GGally)
library(ggplot2)
library(tidyr)
library(caret)
library(MASS)

Dataset

Load the Boston dataset from the MASS package

data("Boston")

Explore the data

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The dataset describes the housing values in Boston and includes several values that could potentially affect housing values. These include: crime rate (crim), average number of rooms per house (rm), nitrogen oxide concentration (nox), property tax (tax), black (proportion of black residents), and so on.

Let us next create graphical plots of the dataset to visualize correlations:

ggpairs(Boston, mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

The strongest correlations with median housing value are:

  • lstat (lower status of the population): -0.738. More lower class (less educated, poor employment) population will decrease housing prices.
  • rm (average number of rooms): 0.695. More rooms will increase price.
  • ptratio (pupil-teacher ratio): -0.508. Bigger P-T ratios (pupils per teacher) decrease prices.
  • indus (proportion of non-retail business acres): -0.484. More land reserved for businesses will decrease housing value.

Each of the 4 strongest correlations make sense intuitively.

Show summary of the data:

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Some notes about this summary:

  • age: this is the % of houses built before 1940. The median % is 77.5, so most of the houses are pre-1940s (data was collected in 1970s, so not that old at the time)

Analysis

LDA

Scale the dataset

Boston <- data.frame(scale(Boston))
summary(Boston)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Now each variable has mean 0 and std 1. Next, we will categorize the crime rate variable into quantiles

quants <- quantile(Boston$crim)
crimerate <- cut(Boston$crim, breaks=quants, include.lowest=TRUE)

Drop crimerate from original dataset

Boston2 <- Boston[,!(names(Boston) %in% c("crim"))]
Boston2$crim_cat <- crimerate

Divide into train/val datasets with 80% ratio

train_ind <- sample(nrow(Boston2), nrow(Boston2)*0.8)
Boston2.train <- Boston2[train_ind,]
Boston2.test <- Boston2[-train_ind,]

Fit Linear Discrimant Analysis (LDA)

lda.model <- lda(crim_cat ~., data=Boston2.train)

Plot bi-plot with arrows. Let’s use the example function provided on datacamp:

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

Plot with the fitted model

cats <- as.numeric(Boston2.train$crim_cat)
plot(lda.model, dimen=2, col=cats, pch=cats)
lda.arrows(lda.model)

Save categorical crime data form test set and remove it

cats.test <- Boston2.test$crim_cat
Boston2.test <- dplyr::select(Boston2.test, -crim_cat)

Now predict the categories for the test data and cross-tabulate

cats.pred <- predict(lda.model, newdata = Boston2.test)
table(correct = cats.test, predicted = cats.pred$class)
##                  predicted
## correct           [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
##   [-0.419,-0.411]              19             10               3              0
##   (-0.411,-0.39]                3             17               6              0
##   (-0.39,0.00739]               0              5              17              1
##   (0.00739,9.92]                0              0               0             21

Best results were achieved with the largest class (Crime rate 0.00739 – 9.92). The second lowest category (-0.411 – 0.390) had the most wrong predictions. Percentage-wise, however, the second largest category had more wrong predictions (10 correct predictions, 9 wrong, 47.37% of the predictions were wrong). he second lowest category had the most false positives, and the lowest category the most false negatives.

K-means

Next we will explore the data using clustering methods, namely the k-means clustering approach. First, fit the model using k=4:

km.model <- kmeans(Boston, centers=4)

Plot results

pairs(Boston, col=km.model$cluster)

We can optimize the number of clusters by finding k so that the sum of squared errors within clusters is minimized. Using the code from DataCamp:

k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

There is a sharp drop at k=2, and after that the drops are more gradual. We can plot kmeans with k=2.

km.model2 <- kmeans(Boston, centers=2)

Plot

pairs(Boston, col=km.model2$cluster)

It seems that 2 is indeed a good number of clusters; visual inspections shows that there are no clear outliers (i.e. possible third clusters that stick out)

Chapter 5: Dimension reduction practice

As usual, let’s start with loading the libraries:

library(dplyr)
library(GGally)
library(ggplot2)
library(tidyr)
library(caret)
library(MASS)
library(reshape2)

Dataset description

Load the data we processed in the data wrangling section (actually, I’m using the ready made dataset)

data <- read.csv("data/human/human2.txt")

Print a summary of the data structure

str(data)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

The dataset includes human development metrics such as life expectancy, GNI, education level from different countries. Definitions for some metrics:

  • Maternal mortality (Mat.Mor): mortality per 100 000 births
  • Adolescent birth rate (Ado.Birth): number of births to women of 15-19 years of age, per 1000 women in the same age group
  • Female parliamentary represenation (Parli.F): percentage of women in the parliament
  • Female to male ratio in secondary school (Edu2.FM): Women in secondary school/Men in secondary school

Plot variable pairs to visualize correlations

p <- ggpairs(data, mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

The strongest correlations can be seen in

  1. Life expectancy and maternal mortality (-0.857): Countries with high maternal mortality have lower life expectancy
  2. Adolescent birth rate and maternal mortality (0.759): Countries with lots of 15-19 year old mother have higher maternal mortality
  3. Maternal mortality and education level (-0.736): Countries with lower education level have higher maternal mortality
  4. Educational level and life expectany (0.624): Countries with higher educated population have longer life expectancy.

Scale the data

data.scaled <- data.frame(scale(data))

Analysis

Now we shall run the principal component analysis (PCA) on the data. First with the non-scaled data.

pca.original <- prcomp(data)

Draw a biplot

biplot(pca.original, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Then again, this time for the scaled data

pca.scaled <- prcomp(data.scaled)

and plot

biplot(pca.scaled, choices = 1:2)

The results look much different. This is because the different metrics were measured using different units. The units are somewhat arbritrary (you can measure maternal mortality per 100k people or per 1 mil people), so they cannot be compared (apples and oranges). When we normalize the data, all metrics are squeezed between 0 and 1, after which we can better compare them.

When we look at the principal components in the last figure, we can draw see some patterns:

  • When we move to the left on the first principal component and up on the second component, we see Nordic countries that are highly developed in many areas.

  • When we move down along PC2, we see countries that are wealthy but may have issues with gender inequality (women with parliamentary representation and in labor force)

  • When we move to the right, we see countries that are less wealthy, and have problems with maternal mortality and girls giving birth at young age.

  • Lower right hand corner has countries that are poor, have gender inequality and the abovementioned problems connected to child birth.

Tea dataset

Load the tea dataset:

library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.1.2
data(tea)

Summary of the data:

summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 

Structure:

str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

Dimensions:

dim(tea)
## [1] 300  36

The dataset has 36 columns and 300 rows. The columns (variables) are related to people’s tea drinking habits. Most of the variables are categorical and most of them have two levels such as sugar use (yes/no).

Do multiple correspondence analysis and visualize

MCA(X=tea[, -which(names(tea) %in% c('age'))], method="indicator")

## Warning: ggrepel: 8 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 300 individuals, described by 35 variables
## *The results are available in the following objects:
## 
##    name              description                       
## 1  "$eig"            "eigenvalues"                     
## 2  "$var"            "results for the variables"       
## 3  "$var$coord"      "coord. of the categories"        
## 4  "$var$cos2"       "cos2 for the categories"         
## 5  "$var$contrib"    "contributions of the categories" 
## 6  "$var$v.test"     "v-test for the categories"       
## 7  "$ind"            "results for the individuals"     
## 8  "$ind$coord"      "coord. for the individuals"      
## 9  "$ind$cos2"       "cos2 for the individuals"        
## 10 "$ind$contrib"    "contributions of the individuals"
## 11 "$call"           "intermediate results"            
## 12 "$call$marge.col" "weights of columns"              
## 13 "$call$marge.li"  "weights of rows"

In the plot, we see the different variable levels drawn as points on two axes. The distance between the categories tells us how similar they are. Two of the lowest points in the graph are the age group 15-24 and student, meaning that these two levels are similar (young people tend to be students). We can also see some not-so-obvious patterns:

  • Black, green and Earl Grey teas are in completely different parts of the plot. This indicates that subjects with different characteristics (how they drink tea and their perception of the tea) affects what type of tea they drink. For example, older people prefer black tea, younger people prefer Earl Grey. Non-workers are closer to black tea and workers are closer to green tea.

  • Older people prefer unpackaged tea, younger people use packaged tea more often.

  • Workmen prefer cheap tea, older people prefer upscale tea.

  • Young people and/or students are clearly different from other groups.

  • People who drink tea 1 to 2 times/week are more likely to use sugar than those who drink 3 to 6 times/week. Older people don’t use sugar, workers are more likely to use sugar.

Chapter 6: Longitudinal data analysis practice

Load libraries

library(reshape2)
library(ggplot2)
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
library(lmerTest)
## 
## Attaching package: 'lmerTest'
## The following object is masked from 'package:lme4':
## 
##     lmer
## The following object is masked from 'package:stats':
## 
##     step

Analysis of RATS data

We will start by analysing the RATS dataset using the analysis presented in chapter 8 from the MABS book.

Data

Load data

catcols.RATS <- c("ID", "Group", "Time")
time.order <- c("WD1", "WD8", "WD15", "WD22", "WD29", "WD36", "WD43","WD44", "WD50", "WD57", "WD64")
RATS <- read.csv("data/chapter5/RATS.csv")
RATS[catcols.RATS] <- lapply(RATS[catcols.RATS], factor)
RATS <- RATS[-1]
RATS$Time <- factor(RATS$Time, levels=time.order)

Print summary to see what type of data we have:

summary(RATS)
##        ID      Group       Time        value      
##  1      : 11   1:88   WD1    :16   Min.   :225.0  
##  2      : 11   2:44   WD8    :16   1st Qu.:267.0  
##  3      : 11   3:44   WD15   :16   Median :344.5  
##  4      : 11          WD22   :16   Mean   :384.5  
##  5      : 11          WD29   :16   3rd Qu.:511.2  
##  6      : 11          WD36   :16   Max.   :628.0  
##  (Other):110          (Other):80

We have four columns: ID, Group, Time and value. ID, Group and Time are categorical.

Next, visualize data:

ggplot(data=RATS, aes(x=Time, y=value, col=ID)) + geom_point()

Looks like the IDs split into two groups, those below 300 and those above 400. The ones that are above 400 show a slight increase over time. The below 300 group is more constant. The different IDs also seem to be correlated, when one ID increases, others increase too, and when one ID decreases, others do the same, and there is no clear delay. Between WD29 and WD44 there was little to no change, then increase continued. There are also some individual differences between IDs, some increase more than others. We can also plot by group:

ggplot(data=RATS, aes(x=Time, y=value, col=Group)) + geom_point()

Now we see more patterns. The below 300 group actually belongs to the same group. Group 2 has the second largest valeus, but there is one subject (ID 12) who is an outlier. It’s possible that this one ID was mislabeled.

Analysis of BPRS data

Data

Load data

catcols.BPRS <- c("treatment", "subject", "week")
week.order <- c("week0","week1","week2","week3","week4","week5","week6","week7","week8")
BPRS <- read.csv("data/chapter5/BPRS.csv")
BPRS[catcols.BPRS] <- lapply(BPRS[catcols.BPRS], factor)
BPRS <- BPRS[-1]
BPRS$week <- factor(BPRS$week, levels=week.order)

Visualize. First plot by subject:

ggplot(data=BPRS, aes(x=week, y=value, col=subject)) + geom_point()

There is quite a lot of variation between and within participants, so it’s hard to see any interesting patterns in this data. One thing we note is that the values seem to decrease until week 5 or 6, then the decrease stops and there is even increase in some participants.

Then let’s look at the values by treatment, it will probably be more interesting:

ggplot(data=BPRS, aes(x=week, y=value, col=treatment)) + geom_point()

Even with the treatment grouping we note that

Because we are considering the effects of different treatments, it might be also fruitful to compare data at the beginning and at the end of the experiment. Let’s plot a boxplot, first by subject:

ggplot(data=subset(BPRS, BPRS$week %in% c("week0", "week8")), aes(x=week, y=value, col=subject)) + geom_boxplot()

From this it’s again hard to see any clear trend, except that the values seem to decrease by week 8 for almost all subjects.

Then plot by treatment:

ggplot(data=subset(BPRS, BPRS$week %in% c("week0", "week8")), aes(x=week, y=value, col=treatment)) + geom_boxplot()

Now the decrease over time is even clearer. However, there seems to be no difference between the treatments.

Fit linear mixed effects model. The model is defined as follows:

  • We have several measurements from the same subjects, so we use subject as a random effect intercept
  • We want to see if the treatment affected the values, so treatment will be the fixed effect
  • The value could also reduce over time, so we use week as anoter fixed effect (interaction of time and treatment)
  • We will start by considering only the first and last week’s data
lmer.m1 <- lmer(value ~ treatment*week + (1|subject), data=subset(BPRS, BPRS$week %in% c("week0", "week8")))
summary(lmer.m1)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: value ~ treatment * week + (1 | subject)
##    Data: subset(BPRS, BPRS$week %in% c("week0", "week8"))
## 
## REML criterion at convergence: 618.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.9188 -0.6116 -0.1409  0.5151  2.8695 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  27.77    5.27   
##  Residual             149.01   12.21   
## Number of obs: 80, groups:  subject, 20
## 
## Fixed effects:
##                      Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)            47.000      2.973  70.762  15.809  < 2e-16 ***
## treatment2              2.000      3.860  57.000   0.518    0.606    
## weekweek8             -17.700      3.860  57.000  -4.585 2.53e-05 ***
## treatment2:weekweek8    2.250      5.459  57.000   0.412    0.682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) trtmn2 wekwk8
## treatment2  -0.649              
## weekweek8   -0.649  0.500       
## trtmnt2:wk8  0.459 -0.707 -0.707

The difference between treatments was 2.0 (treatment 2 higher than treatment 1) in the beginning, but the effect was not significant (t = 0.518, p = 0.606). The values decreased over time, by week8 the treatment 1 value was -17.70 smaller than in the beginning (t = -4.585, p<0.001). However, for treatment 2 value at week 8 was 2.250 higher than for treatment 1, which is almost the same as in the beginning. Therefore the treatment did not seem to have any effect.

The standard deviation of the random effect of subject (intercept) was 5.27, so there was some difference between subjects.

Based on this analysis, the treatment did not have any effect. This is in line with what we saw in the graphical data exploration. Most of the decrease in values comes from time.

We can print the random coefficients:

coef(lmer.m1)
## $subject
##    (Intercept) treatment2 weekweek8 treatment2:weekweek8
## 1     50.85972          2     -17.7                 2.25
## 2     44.56031          2     -17.7                 2.25
## 3     48.72433          2     -17.7                 2.25
## 4     47.22955          2     -17.7                 2.25
## 5     51.18003          2     -17.7                 2.25
## 6     43.27908          2     -17.7                 2.25
## 7     52.03419          2     -17.7                 2.25
## 8     49.47172          2     -17.7                 2.25
## 9     42.95877          2     -17.7                 2.25
## 10    50.64618          2     -17.7                 2.25
## 11    53.42219          2     -17.7                 2.25
## 12    45.73478          2     -17.7                 2.25
## 13    45.94832          2     -17.7                 2.25
## 14    44.02647          2     -17.7                 2.25
## 15    49.25818          2     -17.7                 2.25
## 16    45.73478          2     -17.7                 2.25
## 17    45.84155          2     -17.7                 2.25
## 18    42.53169          2     -17.7                 2.25
## 19    42.74523          2     -17.7                 2.25
## 20    43.81293          2     -17.7                 2.25
## 
## attr(,"class")
## [1] "coef.mer"

This shows the random intercepts for participants, with respect to the intercept shown in the model summary.

Perhaps the treatment will have different effect on each subject? After all, we saw in the plots that the subjects differed quite a lot. Furthermore, let’s add the rest of the weeks to the analysis:

BPRS$week_num <- as.numeric(substr(BPRS$week, 5, 8))

We can then add a random slope for each subject:

lmer.m2 <- lmer(value ~ treatment*week_num + (1+week_num|subject), data=BPRS)
summary(lmer.m2)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: value ~ treatment * week_num + (1 + week_num | subject)
##    Data: BPRS
## 
## REML criterion at convergence: 2724.1
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0382 -0.6240 -0.0748  0.5322  3.9144 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 69.272   8.323         
##           week_num     1.057   1.028    -0.52
##  Residual             97.076   9.853         
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                     Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)          47.8856     2.3016  27.7107  20.805  < 2e-16 ***
## treatment2           -2.2911     1.9150 318.0005  -1.196   0.2324    
## week_num             -2.6283     0.3657  38.6133  -7.187 1.26e-08 ***
## treatment2:week_num   0.7158     0.4022 318.0005   1.780   0.0761 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) trtmn2 wek_nm
## treatment2  -0.416              
## week_num    -0.649  0.462       
## trtmnt2:wk_  0.350 -0.840 -0.550

This did not change the results markedly. Treatment 2 had -2.2911 lower value. Each week the value lowered by -2.623 (t=-7.187, p < 0.001). Treatment does not seem to have any effect.

Here are the coefficients:

coef(lmer.m2)
## $subject
##    (Intercept) treatment2  week_num treatment2:week_num
## 1     49.07248  -2.291111 -1.223339           0.7158333
## 2     47.05039  -2.291111 -3.280383           0.7158333
## 3     47.72686  -2.291111 -3.211386           0.7158333
## 4     49.83813  -2.291111 -2.437827           0.7158333
## 5     66.68207  -2.291111 -4.342611           0.7158333
## 6     42.56109  -2.291111 -2.550753           0.7158333
## 7     54.71328  -2.291111 -3.500714           0.7158333
## 8     47.79551  -2.291111 -1.530085           0.7158333
## 9     42.09978  -2.291111 -2.985474           0.7158333
## 10    53.35978  -2.291111 -2.552101           0.7158333
## 11    59.93477  -2.291111 -1.927352           0.7158333
## 12    48.45376  -2.291111 -2.880199           0.7158333
## 13    50.43280  -2.291111 -3.122746           0.7158333
## 14    40.84752  -2.291111 -2.186925           0.7158333
## 15    55.19374  -2.291111 -3.407935           0.7158333
## 16    44.45802  -2.291111 -2.947707           0.7158333
## 17    43.78081  -2.291111 -1.592874           0.7158333
## 18    36.66185  -2.291111 -1.757613           0.7158333
## 19    38.07845  -2.291111 -2.472162           0.7158333
## 20    38.97001  -2.291111 -2.656482           0.7158333
## 
## attr(,"class")
## [1] "coef.mer"

We can see that there was quite a lot of variation in how participants responded each week.